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Abstract 

Background: The distance between two genomes is often computed by comparing only the common markers 
between them. Some approaches are also able to deal with non-common markers, allowing the insertion or the 
deletion of such markers. In these models, a deletion and a subsequent insertion that occur at the same position 
of the genome count for two sorting steps. 

Results: Here we propose a new model that sorts non-common markers with substitutions, which are more 
powerful operations that comprehend insertions and deletions. A deletion and an insertion that occur at the same 
position of the genome can be modeled as a substitution, counting for a single sorting step. 

Conclusions: Comparing genomes with unequal content, but without duplicated markers, we give a linear time 
algorithm to compute the genomic distance considering substitutions and double-cut-and-join (DCJ) operations. 
This model provides a parsimonious genomic distance to handle genomes free of duplicated markers, that is in 
practice a lower bound to the real genomic distances. The method could also be used to refine orthology 
assignments, since in some cases a substitution could actually correspond to an unannotated orthology. 



Background 

The genomic distance is often computed taking into 
consideration only the common markers, that occur in 
both genomes [1-3]. Approaches to deal with unique 
markers (that occur in only one genome) also exist, but 
usually allowing only insertions or deletions of these 
markers. Insertions and deletions can be shortly called 
indels. In [4], the operations allowed are inversions and 
indels, while the models given in [5] and [6] consider 
indels and the double cut and join (DCJ) operation [7], 
that is able to represent most large scale mutation 
events in genomes, such as inversions, translocations, 
fusions and fissions. The mentioned approaches assign 
the same weight to all rearrangement operations, includ- 
ing indels, regardless of the size of the affected regions 
and the particular types of the operations. A drawback 
in these models is that, if a deletion and a subsequent 
insertion occur at the same position of the genome, the 
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cost is the same as a deletion and an insertion in differ- 
ent positions. 

In the present work we propose a more parsimonious 
model in which, instead of deleting or inserting, we 
allow the substitution of unique markers between two 
genomes, as illustrated in Figure 1. We do not suggest 
that a substitution occurs in a precise moment in evolu- 
tion, but instead it represents a region that underwent 
continuous mutations (duplications, losses and gene 
mutations), so that a group of genes is transformed into 
a different group of genes (either of which may also be 
empty, allowing a substitution to represent an insertion 
or a deletion). Other studies also represent continuous 
mutations as a rearrangement event [8,9]. By minimizing 
substitutions we are able to establish a relation between 
indels that could have occurred in the same position of 
the compared genomes, identifying genomic regions that 
could be subject to these continuous mutations. Observe 
that we suggest that such regions have a common evolu- 
tionary origin. We develop a method to count the mini- 
mum number of substitutions that could have occurred, 
by assigning the same weight to substitutions and to the 
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Figure 1 (i) An optimal sorting scenario with DCJ operations and indels. (ii) An optimal sorting scenario with DCJ operations and indels in 
which the last two operations occur in the same position of the genome, between markers a and fa. (iii) A more parsimonious alternative to the 
deletion of consecutive markers 5 and u and the insertion of consecutive markers x and y would be the substitution of 5 and u by x and y. 



other operations, similarly to the approaches that handle 
indels. 

We analyze genomes with unequal content, but with- 
out duplicated markers and extend the results given in 
[6] to develop a linear time algorithm that exactly com- 
putes the genomic distance with substitutions and DCJ 
operations. The objective of this model is to provide a 
parsimonious genomic distance to handle genomes free 
of duplicated markers, that in practice is a lower bound 
to the real genomic distances. In the present work, we 
do not study algorithms to generate parsimonious sort- 
ing scenarios. Nevertheless, in the analysis of the evolu- 
tion of human chromosomes X and Y, we manually 
obtain a parsimonious evolutionary scenario under our 
model, that is coherent with the results given in [10]. 

In the remainder of this section we introduce some 
concepts given in [1] and [6] and define the operation 
that substitutes markers in a genome - these are the 
basis of the method that we will present here. 

Preliminaries 

In the present study duplicated markers are not allowed. 
Given two genomes A and B, possibly with unequal con- 
tent, we denote by Q the "reduced" genome [4], that is 
the set of markers that occur once in A and once in B. 
Moreover, the set A contains the markers that occur 
only in A and the set B contains the markers that 
occur only in B. The markers in sets A and B are 



also called unique markers. Observe that the sets Q , 
A and B are disjoint. 

A genome is possibly composed of linear and circular 
chromosomes. Each marker g in a genome is a DNA 
fragment and is represented by the symbol g, if it is read 
in direct orientation, or by the symbol g , if it is read in 
reverse orientation. An example of a pair of genomes is 
given in Figure 2. 

In the following we adopt definitions which we have 
given in [6] (some of them are generalizations of con- 
cepts introduced by Bergeron et al. [1]). 

Q -adjacencies 

Each one of the two ends of a linear chromosome is 
called a telomere and is represented by the symbol °. 
For each marker gs Q , denote its two extremities by g 
(tail) and g 1 (head). A Q -adjacency in genome A 
(respectively in genome B) is in general a linear string v 
- Jx^Jit such that J\ and y 2 ar e telomeres or extremities 
of markers of Q and I, the string composed of the 
markers that are between y 1 and y 2 m A (respectively in 
B), contains no marker that also belongs to Q . The 
string I is said to be the label of v, and the extremities 
Yi and y 2 ar e said to be Q -adjacent. If E is a non-empty 
string, v is said to be labeled, otherwise v is said to be 
clean. 

A Q -adjacency 71^/2 can also be represented by 
Q . Furthermore, °i° represents a linear chromosome 




B ' . > . . . . «- 

Figure 2 For genomes A, composed of three linear chromosomes, and 8, composed of one single chromosome, we have Q = {a, b, C, d, e} , 
A = {s, t, U, V, U)} and B = {x, y, z] ■ 
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composed only of markers that are not in Q . In the 
same way, a Q -adjacency given by a label I corre- 
sponds to a whole circular chromosome composed only 
of markers that are not in Q . This is the only case of a 
Q -adjacency in which we have a circular instead of a 
linear string. 

Two genomes A and B can then be represented by the 
sets Vg (A) and Vg (B) , containing their Q -adjacen- 
cies. For the two genomes in Figure 2, we have 
V g (B) = [oa 1 , a h xyb\ b h c\ c h zd\ d h e\ e h o}, 
Vg (B) = {oa\ cPxyb 1 , b h c l , c h zd t , d h i, e h o] and 
Vg (B) = {oa\ a h xyb x , b h c\ c h zd t , d h e\ e h o}. 

The DCJ operation 

A cut performed on a genome A separates two adjacent 
markers of A. A cut affects a Q -adjacency v of Vg (A) 
as follows: if v is linear, the cut is done between two 
symbols of v, creating two open ends in two separate 
linear strings; if v is circular, the cut creates two open 
ends in one linear string. A double-cut and join or DCJ 
applied on a genome A is the operation that generally 
performs two cuts in Vg (A) , creating four open ends, 
and joins these open ends in a different way. A DCJ 
operation can correspond to several rearrangement 
events, such as an inversion, a translocation, a fusion, or 
a fission [7]. 

We represent by ({yAlt^ , Ys £ 3 I ^Y2 } -» {yA| £ 2 
Y2» Y3 h\U Y4 }) a DCJ applied on Y1V4Y4 and y 3 ^2l2 » 
that creates Y1V2Y2 and Y3^3^4Y4- Observe that one or 
more extremities among y 1( y 2 , Y3 and 74 can be equal to 
0 (a telomere), as well as one or more labels among t lt 
l 2 , €3 and £ 4 can be equal to £ (the empty string). Parti- 
cular cases include circular adjacencies and are 
described in [6]. 

Adjacency graph and the DCJ distance 

The adjacency graph AG(A, B) [1] is the bipartite graph 
that has a vertex for each Q -adjacency in Vg (A) and 
a vertex for each Q -adjacency in Vg (B) . Then, for 
each g£ G , we have one edge connecting the vertex in 
Vg (A) and the vertex in Vg (B) that contain g 1 and 
one edge connecting the vertex in Vg (A) and the ver- 
tex in Vg (B) that contain g 1 . 



The connected components of the graph AG(A, B) are 
cycles and paths that alternate vertices in Vg (A) and 
Vg (B) . A path that has one endpoint in Vg (A) and 
the other in Vg (B) is called an AB-path. In the same 
way, both endpoints of an AA-path are in Vg (A) , as 
well as both endpoints of a BB-path are in Vg (B) . 
Furthermore, AG(A, B) can have two extra types of 
components: each Q -adjacency that corresponds to a 
linear (respect, circular) chromosome is a linear 
(respect, circular) singleton. Linear singletons are parti- 
cular cases of AA-paths and BB-paths. An example of 
an adjacency graph is given in Figure 3. 

The number of AB-paths in AG(A, B) is always even 
and a DCJ operation can be of three types [1,6]: optimal 
when it either increases the number of cycles by one, or 
the number of AB-paths by two; neutral when it does 
not affect the number of cycles and AB-paths; or coun- 
ter-optimal when it either decreases the number of 
cycles by one, or the number of AB-paths by two. 

Singletons, AB-paths composed of one single edge, 
and cycles composed of two edges are said to be DCJ- 
sorted. Longer paths and cycles are said to be DCJ- 
unsorted. The procedure of using DCJ operations to 
turn AG(A, B) into DCJ-sorted components is called 
DCJ-sorting of A into B. The DCJ distance of A and B, 
denoted by d DC j{A, B), corresponds to the minimum 
number of steps required to do a DCJ-sorting of A into 
B and can be easily obtained: 

Theorem 1 ( [l])Given two genomes A and B without 
duplicated markers, we have d DCj {A, B) = \Q | - c - -| , 
where Q is the set of common markers between A and B, 
and c and b are the number of cycles and of AB-paths 
in AG(A, B). 

Runs of unique markers 

Given a component C of AG (A, B), we can obtain a 
string £(C) by the concatenation of the labels of the 
Q -adjacencies of C in the order in which they appear. 
Cycles, AA-paths and BB-paths can be read in any 
direction, but AB-paths should always be read from A 
to B. If C is a cycle and has labels in both genomes A 
and B, we should start to read in a labeled Q -adjacency 
v of A, such that the first labeled vertex before v is a 
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Q -adjacency in B; otherwise C has labels in at most 
one genome and we can start anywhere. Each maximal 
substring of £{C) composed only of markers in A 
(respectively in B is called an A -run (respectively a 
B -run). Each A -run or B -run can be simply called 
run[6]. A component composed only of clean Q -adja- 
cencies has no run and is said to be clean, otherwise the 
component is labeled. We denote by A(C) the number 
of runs in a component C. A path can have any number 
of runs, while a cycle has zero, one, or an even number 
of runs. Figure 4 shows a BB-path with 4 runs. 

Substitutions 

The unique markers in A and B are represented in 
AG (A, B) as labels and singletons and, in order to sort 
A into B, they also have to be considered. Here we pro- 
pose a model in which only the following operation can 
be applied to unique markers. A substitution is an 
operation that affects the label of one single Q -adja- 
cency, by substituting contiguous markers in this label. 

Consider the labels £\ and £ 2 , where \£\\ = m and \£ 2 \ 
= n. The substitution of £\ by £ 2 in a Q -adjacency is 
represented by (7^3 |€ x \t^y 2 -> Yi^l^ih) (for better 
reading in our notation we omit the curly set brackets 
for singleton sets). One or both extremities among y 1 
and y 2 can be equal to 0 (a telomere), as well as one or 
both labels among £3 and £4 can be equal to e (the 
empty string). The substitution of £\ by £ 2 in a circular 
singleton is represented by (|£i|^3| — > \£ 2 \£3\). Observe 
that at most one chromosome can be entirely substi- 
tuted at once (but we do not allow the substitution of a 
linear by a circular chromosome and vice-versa). More- 
over, if m = 0, we have an insertion of n contiguous 
markers. On the other hand, if n = 0, we have a deletion 
of m contiguous markers. Thus, insertions and deletions, 
also called indels, are special cases of substitutions. 

The DCJ-substitution distance of A and B, denoted by 
dp C! (A, B) , is the minimum number of DCJs and sub- 
stitutions required to transform A into B. Since substitu- 
tions include indels, d S Q C j{A, B) is upper bounded by 
the DCJ-indel distance, the minimum number of DCJ 
and indel operations required to transform A into B, 
that can be computed in linear time [6]. In the present 
work we give an approach to exactly compute 
dpcjlA, B) also in linear time. 



Results and discussion 

The main result of the present study is an exact formula 
to compute the DCJ-substitution distance in linear time. 
We achieve this formula by developing the substitution- 
potential of two genomes, a property that allows us to 
obtain a good upper bound to the genomic distance 
with DCJ operations and substitutions. Then we show 
how some special DCJ operations reduce the overall 
number of substitutions and obtain the exact formula. 
Although the objective of this model is to provide a par- 
simonious genomic distance, that in practice is a lower 
bound to real distances, we run some experiments on 
data from human X and Y chromosomes and obtained a 
parsimonious sorting scenario that is coherent with the 
results available in the literature. We also observe that 
the DCJ-substitution method could be used to refine 
orthology assignments. 

The substitution-potential 

Observe that a Q -adjacency with a non-empty label £ 
can be cut in at least two different positions, either 
before or after £. Since the position of the cut does not 
change the effect of the DCJ on <i DC j(A, B), we can 
choose to cut at positions that allow the concatenation 
of the labels of the original Q -adjacencies. As a conse- 
quence, a set of labels of one genome can be accumu- 
lated with DCJ operations. In particular, when we apply 
optimal DCJs on only one component of the adjacency 
graph, we can accumulate an entire run in a single 
Q -adjacency: 

Proposition 1 ( [6])A run can be entirely accumulated 
in the label of one single Q -adjacency with optimal DCJ 
operations. 

Given a DCJ operation p, let A 0 and Ai be, respec- 
tively, the number of runs in AG (A, B) before and after 
p. We define AA(p) = A r A 0 . 

Proposition 2 ( [6])Given any DCJ operation p, we 
have AA(p) > - 2. 

In order to obtain the exact formula for the DCJ-sub- 
stitution distance, we will first analyze the components 
of the adjacency graph separately. Given two genomes A 
and B and a component C e AG (A, B), we denote by 
d DC /(C) the minimum number of DCJ operations 
required to do a separate DCJ-sorting in C, applying 



(2 



h U 4 

V v ' >> » ' V V 

.A-run s-run .A-run B-run 

Figure 4 A BS-path with 4 runs. Only the labels of the Q -adjacencies are represented. 
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DCJs on vertices of C (or vertices that result from DCJs 
applied on vertices that were in C). It is possible to do a 
separate DCJ-sorting using only optimal DCJs in any 
component of AG {A, B), thus, in other words, d DC j{A, 
B) = E c ^AG{A,B)d D cj{C) [2]. In [6] we have already 
defined the indel-potential of a component, denoted by 
1(C), that is the minimum number of runs that we can 
obtain by DCJ-sorting C with optimal DCJ operations 
only, and can be computed with the formula given in 
the next proposition. 

Proposition 3 ( [6])Given a component C in AG(A, B), 
we haveX{C) = ^± , if HQ > 1. Otherwise X(Q = 0. 

Similarly, here we denote by <r(C) the substitution- 
potential of a component C, that is the minimum num- 
ber of substitutions that we can obtain by DCJ-sorting C 
with optimal DCJ operations only. In order to find a for- 
mula to compute (7(C), we first obtain a stronger version 
of Proposition 1 where not only the labels of a run are 
accumulated into a single Q -adjacency, but pairs of 
consecutive runs are accumulated into adjacent Q -adja- 
cencies (that are Q -adjacencies connected by a single 
edge in the adjacency graph). 

Proposition 4 ( [6])If Yij 2 is a clean Q -adjacency in a 
DCJ-unsorted component C of AG (A, B), such that 
neither y 1 nor y 2 are telomeres, then it is always possible 
to extract a clean cycle from C with an optimal DC] 
operation. 

Proposition 5 Two consecutive runs in a component C 
can be entirely accumulated into the labels of two 
adjacent Q -adjacencies of C with optimal DCJs. 

Proof: By Proposition 1 we assume that two consecu- 
tive runs of C are accumulated into Q -adjacencies v A 
and v B . If v A and v B are not adjacent, there are only 
clean Q -adjacencies between v A and v B in C. By Propo- 
sition 4, we can apply optimal DCJs to extract clean 
cycles until v A and v B are adjacent. 

Pairs of consecutive runs that are accumulated into 
adjacent Q -adjacencies can be extracted into a labeled 
DCJ-sorted component, that can be sorted with one 
substitution. Observe that minimizing the number of 
pairs of consecutive runs is equivalent to minimizing 
the total number of runs. Hence, we can determine the 
substitution-potential from the indel-potential. 

Proposition 6 jGiv^n a component C in AG {A, B), we 
havea{C) 



to the number of pairs of labeled adjacent Q -adjacen- 
cies, which is: 



A(C)+1 



, if HQ s 1. Otherwise o(Q = 0. 
Proof: By Proposition 5 we can assume that the runs 
of C are accumulated into pairs of adjacent Q -adjacen- 
cies. By Proposition 3, we can obtain M c ) +1 ~| runs 
doing a separate DCJ-sorting in C with optimal DCJs. 
Moreover, these optimal DCJs can be done in such a 
way that pairs of runs that were accumulated into adja- 
cent Q -adjacencies remain in these adjacent Q -adja- 
cencies. Since each one of these pairs can be sorted with 
one substitution, the substitution-potential of C is equal 



CT(C) = 



A(C)+1 



The formulas to compute X(Q and <t(C), given in Pro- 
positions 3 and 6 above, are indeed very similar. Conse- 
quently, many of the results obtained in [6] can be 
adapted to the new substitution-potential. Let <7 0 and <7i 
be, respectively, the sums of the number a for the com- 
ponents of the adjacency graph before and after a DCJ 
operation p. We then define Aa{p) = a t - a 0 . Further- 
more, let A dc j{p) be respectively 0, +1 and +2 depending 
whether p is optimal, neutral or counter-optimal. We 
also define Ad(p) = A dcj (p) + Aa{p). 

Proposition 7 Given a DCJ operation p acting on a 
single component, we have Ad(p) > + 2 if p is counter- 
optimal, or Ad{p) > 0 if p is neutral. 

We denote by d^ C! {C) the minimum number of 
DCJs and substitutions required to sort separately a 
component C of AG {A, B). The definition of <7 and Pro- 
position 7 guarantee that d s ^ CJ (C) = d DCj {C) + o{C) . 

Observe that, if C is a singleton in the adjacency 
graph, dp C j(C) = 1, corresponding to the insertion or 
the deletion of the whole chromosome. We do not 
allow the substitution of a linear by a circular singleton 
and vice-versa. However, each pair composed by a sin- 
gleton in genome A and a singleton in genome B (such 
that both are linear or both are circular) can be sorted 
with one single substitution, which saves one sorting 
step per pair. Let P L and P c be, respectively, the maxi- 
mum number of disjoint pairs of linear and circular sin- 
gletons in the adjacency graph. Together with the DCJ- 
substitution distance per component, these numbers 
give a good upper bound for d S Q C j{A, B): 

Lemma 1 Given two genomes A and B without dupli- 
cated markers, we have: 

d$ CJ {A,B)<d DC ,{A,B)+ ^ °{C)-P L -P C . 

CeAG{A,B) 

The formula given by Lemma 1 above corresponds to 
the exact distance for a particular set of genomes. Given 
a Q -adjacency y£° of a genome A such that y*°, then 
y is said to be a tail of a linear chromosome in A. Two 
genomes are co-tailed if their sets of tails are equal (this 
includes two genomes composed only of circular 
chromosomes). 

Theorem 2 Given two co-tailed genomes A and B 
without duplicated markers, we have: 

d$ cl {A,B) = d DC ,(A, B) + X CE AG(A, B) a(C) ~ Pl ~ Pc - 
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However, for non co-tailed genomes the use of DCJs 
applied to two components of the adjacency graph can 
lead to a shorter sequence of operations sorting one 
genome into another, as we will see in the next section. 

The DCJ-substitution distance 

Recall that Aa(p) = a x - ct 0 , where CT 0 and CTi are the 
sums of the number a for the components of the adja- 
cency graph before and after p. A DCJ operation p that 
acts on two components of the adjacency graph is called 
recombination. 

Proposition 8 Given any recombination p, we have 
Aaip) > -2. 

Proof: Only the recombinations that decrease or do 
not change the number of runs (AA < 0) have to be 
analyzed (we can not have Act < -1 if the number of 
runs increases). Consider the recombination of two 
paths with i and runs, that result in two new paths 
with i' and /' runs. The best we can have is when i and / 
are multiples of 4, V and /' are multiples of 4 minus 1 
and AA = -2, that gives: 

^=\fV\^] = ^ = ^ = i = i = \f]-^\^=-o-2. 

The analysis of recombinations involving cycles is 
analogous. 

All recombinations involving at least one cycle are 
counter-optimal and any counter-optimal recombination 
has Ad > 0, thus only path recombinations can have Ad 
< -1. The following definitions are similar to those 
given in [6], except that here we have a larger number 
of labeled path types. 

Consider an integer i > 0. For a second integer k e {1, 
3}, let A + k (respectively B + k ) be a sequence with 
odd 4/ + k runs, starting and ending with an A -run 
(respectively B -run). Similarly for k e {2, 4}, let 
AB +k (respectively BA +k), be a sequence with 
even M + k runs, starting with an A -run (respectively 
B -run) and ending with a B -run (respectively 
A -run). An empty sequence (with no run) is repre- 
sented by e. Then each one of the notations AA E , 
AA B+l , AA B+l , AA AB+2 , AA B+3 , AA AB+4 , 

BB f: , BB A+1 , BB B+1 , BB AB+2 , BB A+3 , BB B+3 , 
BB AB+4> AB e , AB A+1 , AB B+l , AB AB+2 , AB BA+2 , 
AB B+3 , AB B +3 , AB AB+i and AB BA+4 represents a 
particular type of path (AA, BB or AB) with a particular 
structure of runs (e, A + 1 , B + 1 , AB + 2 , BA + 2 , 
B + 3 , B + 3 , AB +4, or BA + 4 ). 

The components on which the cuts are applied are 
called sources and the components obtained after the 
joinings are called resultants of the recombination. The 
complete set of recombinations with Ad < -1 is given in 
Table 1. In Table 2 we also list recombinations with Ad 
= 0 that create at least one source of recombinations of 
Table 1. We denote by ♦ an AB-path that can not be a 



Table 1 Path recombinations that have Ad < -1 and 
allow the best reuse of the resultants. 



sources 


resultants 


Acr 


A dcj Ad 


AA AB+A + BBAB+4 


• + • 


-2 


0 -2 


AA Aft ■ 4 + AA AK +4 


AA A+3 + AA B + 3 


-2 


+1 -1 


B^AB +4 + B&AB +4 


BB A+3 + BB B+3 


-2 


+1 -1 


AA AK +4 + AB Aft ,4 




-2 


+1 -1 


+ AB BA+i 


• + AA B+3 


-2 


+1 -1 


BB AB +4 + AB AB +4 


• + BB B+3 


-2 


+1 -1 


BB AB+4 + AB BA+4 


• + BB A+3 


-2 


+1 -1 


AAa+1 + BB AB+4 


• + AB AB + 4 


-1 


0 -1 


AAb+i + BB AB+4 


• + AB BA+4 


-1 


0 -1 


AA-AB+4 + BB A+1 


• + AB BA+4 


-1 


0 -1 


AAab+4 + BB B+1 


• + AB AB + 4 


-1 


0 -1 


AA-AB +2 + BB AB +4 


• + • 


-1 


0 -1 


AA AB +4 + BB AB +2 


• + • 


-1 


0 -1 


AA-AB +2 + BB AB +2 


• + • 


-1 


0 -1 


AAa.o + BB A B ,A 


• + • 


_1 


0 -1 


AA. B+i + BB AB+i 


• + • 


-1 


0 -1 


AAaB+4 + BB A+1 


• + • 


-1 


0 -1 


AA AB+A +BB B+3 


• + • 


-1 


0 -1 


AA A+l + BB A+l 


• + • 


_1 


0 -1 


AA B+1 + BB B+1 


• + • 




0 -1 


^.4+1 + BB AB+2 


• + • 




0 -1 


AA B+ i + BB AB+2 


• + • 




0 -1 


^.48+2 + BB A+\ 


• + • 




0 -1 


AA AB +2 + BB B +1 


• + • 




0 -1 


^^4+1 + BB A+3 


• + • 




0 -1 


AA-B+l + BB B+3 


• + • 




0 -1 


AA A+3 + BB A+l 


• + • 




0 -1 


AA B+3 + BB B+1 






0 -1 


AB AB+i +AB BA+4 ■ + ■ -2+1-1 



source in Tables 1 and 2, such as AB F , AB A+1 , AB B+1 , 

AB BA+2' AB BA+2> AB A+3 an d AB B +3 • 

Proposition 9 The recombinations with Ad = 0 invol- 
ving cycles or circular singletons cannot create new com- 
ponents that can be used as sources of recombinations 
listed in Tables 1 and 2. 

The two sources of a recombination can also be called 
partners. Looking at Table 1 we observe that some types 
of paths have more partners than other types of paths. 
For example, all partners of AB AB+4 and AB BA+4 
paths are also partners of AA AB+4 and BB AB+4 paths. 
Furthermore, some resultants of recombinations in 
Tables 1 and 2 can be used in other recombinations. 
These observations allow the identification of groups of 
recombinations, as listed in Table 3. 

The deductions shown in Table 3 can be computed 
with an approach that greedily maximizes the number 
of recombinations in U, V, W, X, Y and Z in this order. 
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Table 2 Recombinations that have Ad = 0 and create 
resultants that can be used in recombinations with Ad < 
-1 (listed in Table 1). 



sources 


resultants 


Acr 




Ad 


AB AB +4 + AB AB +4 


^.4 +3 + BB B+3 


-2 


+2 


0 


^^+1 + AB BA+i Cj 


• + AA AB +4 


-1 


+1 


0 


AA B +1 + AB AB +4 


• + AA AB+i cf 


-1 


+1 


0 


BB A+\ + AB AB+4 


• + BB .4B +4 


-1 


+1 


0 


BB B+\ + AB 8.4+4 


* + BB .4B +4 


-1 


+1 


0 


AB B.4+4 + AB B.4+4 


^B+3 + BB A+3 


-2 


+2 


0 


AA AB +2 + ^ B .4B +4 


• + AA A+1 


-1 


+1 


0 


AA-AB+l + ^ B B.4+4 


• + AA B+1 


-1 


+1 


0 


BB AB+2 + AB B.4+4 


• + BB A+1 


-1 


+1 


0 


BB .4B +2 + AB .4B +4 


• + BB B+1 


-1 


+1 


0 



Table 3 All recombination groups obtained from Tables 1 
and 2 (the recombinations from Table 2 appear only in 
groups in Y and Z). The column scr indicates the 
contribution of each path in the distance decrease. 





sources 


resultants 


Ad 


scr 


u 


AA-AB +4 + BB AB +4 


2- 


-2 


-1 


V 


2A A AB +4 + BB A+1 + BB B +1 


4- 


-3 


-3/4 




2BB AB +4 + ^.4+1 + AA B +1 


4- 


-3 


-3/4 


w 


AA AB +4 + BB A +1 + AB_4 B +4 


3- 


-2 


-2/3 




^.46+4 + BB B+l + AB B.4+4 


3- 


-2 


-2/3 




BB .4B+4 + AA A+l + AB BA+A 


3- 


-2 


-2/3 




BB AB +4 + AA B +1 + AB AB +4 


3- 


-2 


-2/3 




2AA .4B+4 + 


2.+A4 s+3 


-2 


-2/3 




2A A AB+4 + BB B+1 




-2 


-2/3 




2BB AB+4 + ^.4+1 


2.+BB B+3 


-2 


-2/3 




2BB AB+4 + ^B+l 


2.+BB^ +3 


-2 


-2/3 


X 


Recombinations from Table 1 with A/ 




-1 


-1/2 



= -1 



Y 2 AB AB+4 +AA B+1 + BB A+1 4- -2 -1/2 

2AB BA+4 + AA A+1 + BB B+1 4- -2 -1/2 

1/3 
1/3 
1/3 
1/3 
1/3 
1/3 
1/3 
1/3 
1/3 
1/3 
1/3 
1/3 
1/3 
1/3 



The U part contains only one operation and the two 
groups in V are mutually exclusive after applying U. 
The part W is then the application of all possible 
remaining groups of two operations with Ad = -2. Simi- 
larly, the part X is only the application of all possible 
remaining operations with Ad = -1. After X, the two 
groups in Y are mutually exclusive and then the same 
happens to the groups in Z. Although some groups in 
W, X and Z have some reusable resultants, those are 
actually never reused (if operations that are lower in the 
table use as sources resultants from higher operations, 
the sources of all referred operations would be pre- 
viously consumed in operations that occupy even higher 
positions in the table). Due to this fact, the number of 
operations in U, V , W, X, Y and Z depends only on the 
initial number of each type of component. 

With the results presented in this section we have an 
exact formula to compute the DCJ-substitution distance: 

Theorem 3 Given two genomes Aand Bwithout dupli- 
cated markers, we have: 

dg CJ (A,B) = d DC ,(A,B)+ £ a(C )-P L -P c -2U-3V-2W-X-2Y-Z, 

CeAGU.B) 

where P L and P c are the numbers of disjoint pairs of 
linear and circular singletons and U, V, W, X, Y and Z 
are computed as described above. 

The formula given in Theorem 3 is analogous to the 
one which we have obtained in [6] to compute the DCJ- 
indel distance. Both formulas depend on factors that 
can be computed in linear time [6]. 
Triangular inequality 

Note that, since only unique markers can be substituted 
in this model, we avoid the "free lunch problem", men- 
tioned in [5], that is the possibility of transforming any 
genome A into any genome B by simply substituting the 
whole content of A by the whole content of B. However, 
the triangular inequality can be disrupted in the DCJ- 
substitution distance. In other words, given any three 
genomes A, B and C without duplicated markers, there 
is no guarantee that the triangular inequality 
d$ c ,{A, B) < d$ C] {A, c) + da Q {B, C) holds. In a compa- 
nion paper [11] we provide an efficient way to establish 
the triangular inequality a posteriori in both the DCJ- 
indel [6] and the DCJ-substitution distances. 

Experiments 

The objective of this model is to provide a parsimonious 
genomic distance, that in practice is a lower bound to 
real distances. Nevertheless, we could run some experi- 
ments on data from human X and Y chromosomes and 
obtained a parsimonious sorting scenario that is 



ab ab +4 + AA AB +2 + BB A+3 
A b ba+i + AA AB+2 + BB B+3 
A B BA+i + ^.4+3 + BB AB+2 
AB AB +4 + AA B +3 + BB^g +2 
AB AB+4 + AA B+1 + BB_4 +3 



AB AB +4 + AA 



B+3 



+ BB 



.4+1 



AB BA+i +AA A+1 +BB 



B+3 

A B BA+4 + AA.A+3 + BB B+1 



AB AB +4 + AA 



B+l 



+ BB 



AB BA+4 +AA A+1 +BB 
2AB ab +4 + AA B +1 

2AB AB+4 + BB A+1 

2AB BA+4 + AA A+1 

2AB BA+4 + BB B+1 



A+l 
B+l 



2»+AB BA+4 
2»+AB AB+4 
2»+AA A+3 



2»+BB 



B+3 



2»+AA 



B+3 



2 •+ BB A + 3 
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coherent with the results available in the literature. Dur- 
ing evolution, a portion of the human Y chromosome 
has become increasingly subjected to local mutations, 
while the X chromosome remained relatively conserved, 
as we will see in the following. Human X and Y chro- 
mosomes are very different and, while X is 155 Mbp 
long, the Y chromosome is 58 Mbp long. However, they 
still share pseudo-autosomal regions at both extremities 
and are believed to have evolved from an identical auto- 
somal pair [12] (the autosomes are all non-sex chromo- 
somes). Current theories suggest that the pseudo- 
autosomal region, which originally covered the whole 
chromosomes, was successively pruned by a few big 
inversions on the Y chromosome [13] (we call these 
inversions pruning). After each pruning inversion, sev- 
eral mutations seem to have occurred on the affected 
part of the Y chromosome, while X remained "closer" to 
the common ancestor. 

A parsimonious scenario of 8 inversions on the mar- 
kers common to chromosomes X and Y has been pub- 
lished in [[10], Fig. 7], and is given as an argument to 
support the existence and bounds of the three most 
recent pruning inversions, but unique markers were 
simply ignored. We used our method to compute the 
DCJ-substitution distance using the same dataset, but 
reincorporating the unique markers, and obtained a 
DCJ-substitution distance of 14. Then we manually 
reconstructed the evolutionary scenario of human chro- 
mosomes X and Y and obtained a parsimonious sce- 
nario with 8 inversions and 6 substitutions (including 2 
insertions and 1 deletion) that is coherent with the 
pruning inversions given in [10] (see Figure 5). Although 
a DCJ is a very comprehensive operation and can repre- 
sent many rearrangement events, in the analysis of 



unichromosomal genomes DCJs often represent only 
inversions, and this also happens in this dataset. 

Discussion 

Our method was designed to find gene mutations, but it 
could also help to improve orthology assignments, that 
are the computational prediction of orthologous pairs of 
genes from different species. No orthology predictor is 
able to find all assignments correctly. In particular, 
when comparing two different species, some pairs of 
orthologous genes that are below the predictor thresh- 
old remain unassigned. Since our substitutions establish 
a relation between different genes in the two compared 
genomes, they correspond to candidates to be assigned 
as orthologous genes. 

Conclusions and future work 

In this work we presented a new model to compare two 
genomes with unequal content, but without duplicated 
markers, using substitutions and DCJ operations, and 
developed a linear time algorithm to exactly compute 
the DCJ-substitution distance. 

Although the objective of this model is to provide a 
parsimonious genomic distance, that in practice is a 
lower bound to real distances, based on our method we 
have manually reconstructed a parsimonious evolution- 
ary scenario of human chromosomes X and Y. We con- 
sidered biological constraints that are specific to this 
case and obtained a scenario that is coherent with the 
results given in the literature. 

By reconstructing a parsimonious scenario that mini- 
mizes substitutions, we may identify genomic regions 
that were subject to continuous mutations during evolu- 
tion and could have a common evolutionary origin. 



P 1 2 3 4 5 6 7 8 xi 9 x 2 10 x 3 |11 x 4 12 x 5 \ 

_J _ 

P 1 2 3 4 5 6 7 8 xi 9 x-2 10 x 3 : xE 12|x7 111 

I 

P 1 2 3 4 5 6 7 8 xi 9 x 2 10 xyxj 12 ll|x 4 l 

I 

P 1 2 | 3 4 5 6 7 8 xi 9 x 2 10 x :i x£ 12 ll | m V7 

_ _i 
P 1 2 : 11 12 x s xi 10 xi 9 xl 8 7 6 5 4|3 yE\y 7 

_ _ _ I 

P 1 2-11 12\x 5 xi\l0 xi 9 xl 8 7 6 5 4 y 6 3 y 7 

_ I _ 
P 1 2 : 11 12 yl y s 10 xi 9 xl 8 7|6 5|4 y 6 3 y 7 

I 



P 1 2 : 11 12 yl y 3 10 xi 9 xl 8 7 5 6 4 y 6 3 y 7 

_ _ - __ * _ 

P 1 2 : 11 12 yl y 3 10 xi 9\xl 8|7 5 6 y 5 4 y 6 3 y 7 

I 

P 1 2 : 11 12 yl i/3 10 xi 9 8 In 7 5|6 y 5 4 y 6 3 y 7 

I 

P 1 2 : 11 12 yl j/3 10 xi 9 8 5 7|xT|6 y 5 4 y 6 3 y 7 

_ _ _ _ I 

P 1 2 : 11 12 yl 2/3 W\xi\9 8 5 7 y 4 6 y 5 4 y 6 3 y 7 

_ _J_ 

P | 1 2 11 12 27T | 2/3 10 9 8 5 7 j/ 4 6 y 5 4 y 6 3 y 7 

P;yi 12 11 2|l 2/3 10 9 8 5 7 j/ 4 6 y 5 4 2/6 3 y 7 



P;yi 12 11 2 2/2 1 2/3 10 9 8 5 7 2/ 4 6 y 5 4 2/6 3 y 7 

Figure 5 A parsimonious scenario of 8 inversions and 6 substitutions (including 2 insertions and 1 deletion) sorting human X into Y 
chromosome, using the dataset given in [10]. The symbol 'P' represents the current pseudo-autosomal region in the beginning of X and Y. Each 
number represents a common marker, each symbol x,- represents a unique marker in X and each symbol y, represents a unique marker in Y (the 
unique markers were also obtained from the data in [10]). The three pruning inversions suggested in [[10], Fig. 7] are underlined. The boundary 
of the pseudo-autosomal region, indicated with vertical dots, is shifted to the left after each pruning inversion. 
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Currently our method is only able to compute the geno- 
mic distance, but in a future work we intend to study 
the space of all parsimonious sorting scenarios and 
develop methods to systematically identify such regions. 

The DCJ-substitution model could also be used to 
refine orthology assignments, since in some cases a sub- 
stitution could actually correspond to an unannotated 
orthology. We also plan on exploring the use of our 
method in refining orthology in a future work. 
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